Navya Yarrabelly,
International Institute of Information and Technology - Hyderabad, yarrabelly.navya@students.iiit.ac.in
PRIMARY
P Yashaswi, International Institute of Information and Technology - Hyderabad, p.yashaswi@students.iiit.ac.in
Veera Raghavendra Chikka, International Institute of Information and Technology
- Hyderabad, raghavendra.ch@research.iiit.ac.in
Kamalakar Karlapalem (Advisor), International Institute of Information and
Technology - Hyderabad, kamal@iiit.ac.in
Student Team: YES
QGIS, for plotting the geospatial data
locations and paths
R, to analyze the graphs
D3.js
JavaScript
TwitInfo, adapted by the team for the challenge
Weka, for clustering and analyzing the clusters visually
Stanford NER, for named entity extraction
Tweet NLP,for POS tagging of microblog data
Approximately how many hours were spent working on this submission
in total?
150 hours
May we post your submission in the Visual Analytics Benchmark
Repository after VAST Challenge 2014 is complete?
YES
Video:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Questions
Please note - this challenge
contains a question that is time-dependent.\A0 Within 3 hours of starting the
final data stream, send an email to VASTChal2014MC3@vacommunity.org containing your answer to question
MC3.1.\A0 Please include a copy of your answer to MC3.1 in your final answer
form also. Your answers to MC3.2 and MC3.3, along with your video, are due July
8.
The
responses to these questions should be incrementally built, as you (the
contestant) acquire information from each streaming data segment you
receive. Your submission will answer these questions in consideration of
all of the streaming data segments.
MC3.1 - Within 3 hours after start the
final data stream, send an email to VASTChal2014MC3@vacommunity.org
containing:
1.
An image showing the streaming data in your
visual analytics tool. In this image, identify an event of interest that you
intend to investigate further.
2.
The content of the final message in the data
stream
a. Identified
event is "Fire at Dancing Dolphin" . In figure 1 on clicking the
event , it will highlight all the occurences in the stream and showing the
event-related information.
Figure 1 :Image showing
streaming of the data along with the trends(bigrams) sized from large to small
and coloured as a gardient of red to green in order of its magnitude of
populairity.
b. The
content of the final message in the stream is : "RT @KronosStar There has
been an explosion from inside the apartment building. Several people are down.
#KronosStar #DancingDolphinFire #AFDHeroes ".
MC 3.2 -
Describe the
timeline of up to five major events that you discover in the streaming data.
This timeline should include information from all three segments of the data
stream if needed. Use specific microblog messages and call center
data to support your description, but do not simply mimic back the data
stream. Provide a concise description of important participants,
locations and durations. Focus your response on the events themselves,
rather than on the individuals reporting the events. Please limit your answer
to no more than ten images and 1500 words.
1.
Event Detection:
To detect events from the given data, we used an adaptation of 'bursty words'
algorithm.
1. From the given data of microblog messages, we identified a set 'S' of textual features as frequently occuring bigrams with a minimum threshold 'T' over the given time period.
2. Each microblog message is considered as a transaction(Item Set) with items as the textual features identified, ignoring non-textual features. From this set of transactions, we found another set 'D' of disjoint frequent item sets i.e features which have co-occured frequently in the microblog messages, with a threshold over 3*T/4 .
3.
We carried out a clustering process on the set
'D' with similarity metric as follows
let (k1, k2, k3) be a transaction T1
and (k2, k4) be a transaction T2 and n(ki,kj)
represents the number of microblog messages in which both the features ki
and kj have occured.
Similarity(T1,T2) = n(k1,k2)+n(k1,k4)+
n(k1,k2) + n(k2,k4)
+n(k3,k2)+ n(k3,k4)
⁄ (n(k1) + n(k2)+ n(k3) + n (k4)) .
At the end of the clustering process, each cluster represents one event. From
these clusters we picked top-5 largest clusters, each representing an event and
each point in the cluster is a microblog message that describes the event.All
the data-points in a cluster are taken as event related transactions.With the
mapping from a message to a transaction, all the microblog messages, whose
correpsonding transactions belong to the event cluster are taken as event
realted microblog data. Then an event is denoted by the most frequent bigram
from its related microblog data.
5 major events identified from the streaming data are
Figure 2 :Image showing events identified and features of the events. Size of the circle represents the popularity of the event.
2.
Event Timeline :
For each event 'E' we collected all the related microblog messages as mentioned
above. We divided the time(270 minutes) into intervals of 10 minutes size and a
graph is plotted with number of event-related microblog messages in a given
interval vs time. From this graph, we identified the peaks using 'Twit Info
peak detection algorithm'. Each peak in the graph represents a sub-event, which
is described by a set of most-frequent unigrams and bigrams crossing a
threshold 'T2' and the duration of the sub-event is given by the time-intervals
during which the event has reached a peak. Also, we collected the microblog
messages related to the sub-event as the message containing the largest number
of unigrams and bigrams which describe the sub-event.The complete set of
microblog messages which describes all the sub-events is taken as evidence to
describe the timeline of the event.
Figure 3 :Image showing Timeline of the events, with their peaks marked in the graph-number of event related messages per unit time-interval vs time.
Figure 4 : Image showing the timeline of all events with supported description from microblog data and call center data
3.
Event duration:
For each event, we sort all its sub-events by their respective start times.Let
'SE' denote the sorted list of sub-events,then the start-time of an
event is given as the time of the first microblog message in the time interval
of the its first sub-event and the end-time of the event is given as the time
of the last microblog message in the time interval of the last sub-event from
the sorted sub-events list 'SE'.
Figure 5 :Image showing Duration of the events.
4.
Event Location:
The call center messages sent during the peak time-intervals of an event are
considered as related call center data of that event. From this data, we
manually filtered the messages which are not related to the event (ex: 'TRAFFIC
STOP'). Then the locations are plotted using QGIS tool. We indexed these
locations w.r.t time-intervals of 10 minutes and considered only those
locations which are in vicinity range w.r.t other locations in its preceding
and succeding time intervals. For static events(ex:"pok rally") we
took only one location which has the highest frequency from the final filtered
list of locations. For dynamic events (ex:"black van") we took all
the locations in the final filtered list of locations.
Figure 6 :Image showing locations of the events.The colour of the location is a gradient of red to green in the order of time at which the event has took place. Alternatively, the label of the event is prefixed by a serial number which shows the same. Hovering over the location gives the name of the location. No location has been identified for the event "suspects arrested"
5.
Event Participants:
We used CMU Twitter NLP and Part Of Speech tagging tool to tag the microblog
data for each event. From this we extracted all the longest consecutive tags
with 'NNP', which is the candidate set of participants. From this set we
removed the locations of the events identified above and extracted the
participants through manual inspection. In addition to that, we also used Stanford
NER tool to extract the named entities and added the phrases identified as
PERSON and ORGANISATION to the list of participants.
Event |
Participants |
pok rally |
POK leader Sylvia Marek, Dr. Audrey McConnell Newman
award-winnign activist, Lucio Jakab, Viktor-E, Abila police, POK community |
dancing dolphin |
Abila Fire Department, Abila Police Department |
black van |
Black van guys, gunmen, hostages |
shots fired |
olice officer, Abila Police SWAT team |
suspects arrested |
APD |
6. Table 1: Table showing the events and the participants involved.
Figure 7 :Interface to select an event and view the timeline of an event with supporting microblog and call center data along with the locations and particpants of the event. The above image is a snapshot of the interface for the event "black van"
MC 3.3 \96 Select one of your five major
events from question MC 3.2 that you consider to be most likely to provide
additional clues to the investigation of the GASTech
disappearances. Describe the roles of the participants.
Describe how other events you identified in MC3.2 may have influenced your
selected event. Provide a hypothesis and evidence as to whom you suspect as
being directly involved in the GAStech
disappearances, either as perpetrators or victims. Please limit your
response to no more than five images and 500 words.
As per our hypothesis, POK is NOT involved in the GAStech
disappearances.
Of the 5 events identified, we would pick the event "black van" as
a crucial event to provide further evidence to the GAStech disappearances.
Roles of the partcipants :
The gunmen and hostages in the black van is to create terror among the people.
The hostages could also be the members of POK or from the people present at the
rally.
Influence of other events:
From Figure 7, the black van has started at the location of the dancing dolphin
and ended at the location very close to that of POK rally, where both places
being the locations of the other two major events "pok rally" and
"dancing dolphin fire" is suspicious. Fire at dancing dolphin, during
the pok rally cannot be a coincidence. It could serve as a purpose to provide
distarction from POK rally and also for the dispersal of the police to the
other end of the city, which is evident from Scene 2 of FIGURE 7 and also from
FIGURE 8. From the timeline of the events, the movement of the black van has
started after the evacutaions has started at the dancing dolphin, allowing the
black van to move freely. The accident at Schaber Ave could also be intentional
, so that they could be pursued by the police and at the same time it would
create some terror among the people of abila. From the location of the event
"shots fired", it is very close to that of the location of POK rally,
causing riots at the rally, which has been peaceful till then. As POK members
would not have any motive to create riots at their own rally, black van guys
are not related to POK members.So the suspicious black van could kidnap POK
members while creating a fire at dancing dolphin, would provide a perfect
reason to cause terror in the city.By this analogy the black van could
be viewed as a perpetrator behind the GAStech disappearences also, as it
fits well into the scenario and has the motive to do so. We would get the
further clues on evaluating the genunity of the event "suspects
arrested"
Figure 7 :Image illustrating the complete scenario of the events
Figure 8 :Image showing that the other events have contributed for the distraction from pok rally